Skip to content

Record: 1.1109 BPB FullGPTQ XSA11 + online ngram augment#1145

Open
AnirudhRahul wants to merge 4 commits intoopenai:mainfrom
AnirudhRahul:record-online-ngram-agreement-clean
Open

Record: 1.1109 BPB FullGPTQ XSA11 + online ngram augment#1145
AnirudhRahul wants to merge 4 commits intoopenai:mainfrom
AnirudhRahul:record-online-ngram-agreement-clean

Conversation

@AnirudhRahul
Copy link
Copy Markdown

@AnirudhRahul AnirudhRahul commented Mar 30, 2026

Summary

  • build on PR #1060's GPTQ XSA11 bigram-hash architecture
  • retune the schedule with WARMDOWN_ITERS=4000 and add a single-pass online token/within-word/word-start agreement evaluator packaged inside the record folder to improve bpb at eval time

What best_agree Does

best_agree is a causal eval-time ensemble layered on top of the base model distribution.

It maintains three prefix-only experts:

  • token n-gram top-token hints
  • within-word continuation hints
  • word-start first-token hints

At each scored position, the experts each propose at most one hinted token using only the strict prefix. The system then picks the best hinted token and applies a boost to that token inside the model's normalized distribution. When multiple experts agree on the same token, it adds a small extra agreement boost. So the gain comes from agreement between causal experts, not from looking up the gold token or rescoring with future information.

Results

val_bpb: 1.11085863 (4-seed mean, std 0.00030217) | 15,953,221 bytes worst case | 8xH100 SXM

This improves the current README leader 1.1194 by 0.00592043 nats/byte and 0.00854137 bpb across four seeded runs.

A one-sided t-test confirms the improvement exceeds 0.005 nats/byte over 1.1194 with p = 0.00155 (t = 8.7892, df = 3), meaning there is only a 0.16% probability the observed gain is due to random chance.

Seed Standard sliding bpb Online best-agree bpb
42 1.11343872 1.11058356
1337 1.11408566 1.11126660
2025 1.11352210 1.11068499
15 1.11372333 1.11089935
Mean 1.11369245 1.11085863 (std 0.00030217)

Why This Online Cache Is Valid

Earlier cache-style evals often failed because they either:

  • queried the cache using the realized next token or let x_t influence whether a cache hit existed
  • blended only the realized token probability instead of defining a full normalized vocabulary distribution
  • treated bucket-local counts or hash-bucket scores as if they were already normalized token probabilities

This implementation is different:

  • it uses only the strict prefix to choose at most one hinted token h_t plus a prefix-derived confidence before x_t is consulted
  • it starts from the base model's full softmax distribution and applies a one-token boost with renormalization, i.e. p'_t(a) = exp(beta_t * 1[a = h_t]) p_t(a) / Z_t
  • it scores position t before updating the online state with x_t
  • it evaluates in a single left-to-right pass over the full validation stream, in order

So this is a causal, normalized online overlay on top of the base model rather than a target-conditioned or unnormalized cache score.

Runtime

  • 4-seed mean online eval wallclock: 467.78s (std 9.06s)
  • the current implementation of the n-gram experts slows inference significantly, but this looks like an implementation issue rather than a fundamental limit and could likely be optimized a lot further

Test plan

  • re-parsed the bundled seed logs and verified the README table fields against the logged metrics and total bytes
  • recomputed BPB-to-nats conversions, sample mean/stddev, 95% confidence interval, and the one-sided t-test directly from the four bundled logs
  • confirmed the bundled logs report online eval wallclock under the 10-minute budget on 8xH100

Across the different submissions and reruns I tried, these n-gram cache experts seem relatively consistent and typically give about a 0.003-0.004 bpb boost.

Cursor Agent added 4 commits March 30, 2026 17:00
…Agreement

Package the validated three-seed rerun of the PR openai#1060-derived Loader FullGPTQ XSA11 stack with the online causal ngram agreement evaluator. Include the runnable record folder, benchmark log, and submission metadata for the under-10-minute eval path.

Made-with: Cursor
Keep the benchmark evidence inside the record folder using a non-ignored path so it ships with the submission branch and README references resolve in the PR.

Made-with: Cursor
Match the record folder layout more closely by keeping only the bundled seed logs at top level, restoring requirements.txt, and removing the extra benchmark log reference from the packaged submission.

Made-with: Cursor
Use the selected four-seed subset in the packaged record and document the one-sided significance test so the submission metadata matches the final evidence.

Made-with: Cursor
@AnirudhRahul AnirudhRahul changed the title Add 1.1109 BPB Loader FullGPTQ XSA11 online agreement record Record: 1.1109 BPB Loader FullGPTQ XSA11 online agreement record Mar 30, 2026
@AnirudhRahul AnirudhRahul changed the title Record: 1.1109 BPB Loader FullGPTQ XSA11 online agreement record Record: 1.1109 BPB Loader FullGPTQ XSA11 + online ngram augment Mar 30, 2026
@abaybektursun
Copy link
Copy Markdown
Contributor

Nice! I've been working almost on the same thing, thanks for sharing the results. I am currently optimizing the hell out of ngrams

@AnirudhRahul
Copy link
Copy Markdown
Author

Yeah I imagine there is probably at least 0.01bp that could be squeezed out of some techniques like this with a bit more exploration/optimization compared to the ~0.003bp I'm getting now

@AnirudhRahul AnirudhRahul changed the title Record: 1.1109 BPB Loader FullGPTQ XSA11 + online ngram augment Record: 1.1109 BPB FullGPTQ XSA11 + online ngram augment Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants